# import data set
originalDt <- read_csv("diabetes_012_health_indicators_BRFSS2015.csv",show_col_types = FALSE)
The data set to be explored is an adapted version of the 2015 survey results (https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system) from The Behavioral Risk Factor Surveillance System (BRFSS). BRFSS, operated by Centers of Disease Control and Prevention (CDC), is a system that collects data “on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases” via telephone surveys every year. In 2015, the population for the survey is consisted of adult residents from all 50 states of the U.S., as well as the District of Columbia and three U.S. territories.The original data set contains 441,455 observations and 330 variables that recorded responses to questions in the survey, and these variables provide demographic, health related and other calculated data of each participant. In the adapted data set, the author Alex Teboul processed the original data and posted the cleaned data set to Kaggle (https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset). In Teboul’s treatment,important risk factors are selected and refactored for readability, and NA values are dropped. After processing, the data set has 253,680 observations and 22 variables. In both original and adapted data set, the privacy of participants are protected by excluding any identifiable information.
knitr::kable(head(originalDt), col.names = gsub("[.]", " ", names(originalDt)))
| Diabetes_012 | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | Veggies | HvyAlcoholConsump | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 40 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 5 | 18 | 15 | 1 | 0 | 9 | 4 | 3 |
| 0 | 0 | 0 | 0 | 25 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 0 | 0 | 7 | 6 | 1 |
| 0 | 1 | 1 | 1 | 28 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 5 | 30 | 30 | 1 | 0 | 9 | 4 | 8 |
| 0 | 1 | 0 | 1 | 27 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 11 | 3 | 6 |
| 0 | 1 | 1 | 1 | 24 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 2 | 3 | 0 | 0 | 0 | 11 | 5 | 4 |
| 0 | 1 | 1 | 1 | 25 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 2 | 0 | 1 | 10 | 6 | 8 |
In preliminary examination of the data set, we retrieved detailed explanation of each variable from the official codebook (https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf). It is worth noting that most variables in this data set are discrete except for BMI, which in principle is continuous but reported as integers in this survey.
# get column names of the data set
names(originalDt)
## [1] "Diabetes_012" "HighBP" "HighChol"
## [4] "CholCheck" "BMI" "Smoker"
## [7] "Stroke" "HeartDiseaseorAttack" "PhysActivity"
## [10] "Fruits" "Veggies" "HvyAlcoholConsump"
## [13] "AnyHealthcare" "NoDocbcCost" "GenHlth"
## [16] "MentHlth" "PhysHlth" "DiffWalk"
## [19] "Sex" "Age" "Education"
## [22] "Income"
In terms of the aspect being assessed, variables are categorized into three classes:
target
Although the data set features indicators that help to predict whether a
person has diabetes, it also contains stroke status of the person. In
addition, most of the selected variables in the data set are also risk
factors for stroke. Therefore, we choose stroke as our target (dependent
variable), which better fits in our academic interests.
socioeconomically related variables
Age, Education, Income, AnyHealthcare, NoDocbcCost
health/personal variables
Within this class, we subdivided variables into two categories: (a)
lifestyle factors: BMI, Smoker, PhysActivity, Fruits, Veggies,
HvyAlcoholConsump (b) disease factors: Diabetes_012, HighBP, HighChol
(CholCheck related to this variable)
Other variables not included in these categories are considered irrelevant.
Based on the categorization, we decide to approach the exploration of the data set from two perspectives. Firstly, we study the relationship between Education/Income and stroke risk at socioeconomic level. Adding on to this point, we also select a series of variables from the health/personal class to investigate the correlation between these factors and stroke risk. After these two parts, we dive deep into the specific case of state of Georgia to look at the correlation between socioeconomic status and stroke incidence/mortality.
Socioeconomic factors, such as income and education level, can significantly impact an individual’s access to resources and thus their overall health. By investigating the association between socioeconomic status and stroke, we can gain insights into the underlying social determinants of stroke risks.
# median income and education level
med_income <- median(strokeDt$Income)
med_edu <- median(strokeDt$Education)
print(paste("The median for income level is",med_income,", and the median for education level is",med_edu))
## [1] "The median for income level is 7 , and the median for education level is 5"
# heat map of number of people in each group of income vs. education
summary_tibble %>%
ggplot(aes(Income, Education))+
geom_tile(aes(fill = num_ppl))+
scale_fill_gradient(low = "white",high = "black",labels = comma_format())+
labs(fill = "Number of people", title="Number of people in groups of different income and education levels")+
geom_text(aes(label = num_ppl), color = "darkblue", size = 3, fontface = "bold")+
theme(plot.title = element_text(hjust = 0.5))
The data set represents people from all the income levels ranging from annual income less than $10,000(1) to more than $75,000(8), and all the education levels ranging from “never attended school”(1) to “college graduate”(6). The relative weight of each group, however, is not equal. As indicated by the median income and education level (7 and 5, respectively), respondents in the survey tend to be well paid and educate. The same trend can be better visualized with the heat matrix, where sample population concentrates at the top-right quadrant. Therefore, the implication of the imbalanced group size distribution should be considered in the following analysis.
# bar plot for income vs stroke
income_stroke <-
strokeDt %>%
group_by(Income) %>%
summarise(
numPpl = n(),
Stroke_1 = sum(Stroke == 1),
Stroke_0 = sum(Stroke == 0),
Stroke_1_pct = Stroke_1/numPpl*100,
Stroke_0_pct = 100 - Stroke_1_pct)%>%
ungroup() %>%
select(Income,Stroke_1_pct,Stroke_0_pct) %>%
pivot_longer(-Income, names_to = "variable", values_to = "Percent") %>%
ggplot(aes(x = Income, y = Percent, fill = variable)) +
geom_bar(stat = "identity", position = "dodge", width = 1)+
scale_fill_manual(values = c("lightblue", "red"),labels=c('No Stroke', 'Has Stroke'))+
labs(fill = 'Stroke Status',y="Percentage")+
theme(plot.title = element_text(hjust = 0.5),legend.position = "None")
# bar plot for education vs stroke
edu_stroke <-
strokeDt %>%
group_by(Education) %>%
summarise(
numPpl = n(),
Stroke_1 = sum(Stroke == 1),
Stroke_0 = sum(Stroke == 0),
Stroke_1_pct = Stroke_1/numPpl*100,
Stroke_0_pct = 100 - Stroke_1_pct)%>%
ungroup() %>%
select(Education,Stroke_1_pct,Stroke_0_pct) %>%
pivot_longer(-Education, names_to = "variable", values_to = "Percent") %>%
ggplot(aes(x = Education, y = Percent, fill = variable)) +
geom_bar(stat = "identity", position = "dodge", width = 1)+
scale_fill_manual(values = c("lightblue", "red"),labels=c('No Stroke', 'Has Stroke'))+
labs(fill = 'Stroke Status',y="Percentage")+
theme(plot.title = element_text(hjust = 0.5),legend.position = "None")
# combine two plots together
combined_plot<-plot_grid(income_stroke, edu_stroke,ncol=2)
legend <- get_legend(
income_stroke +
guides(color = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
)
title <- ggdraw() +
draw_label(
"Stroke ratio distribution with respect to income/education levels",
x =0.5,
hjust = 0.5,
size = 16,
)
plot_grid(title,combined_plot,legend,ncol=1,rel_heights = c(0.1, 1))
After learning overall income and education situations of the sample respondents, we then investigate the association between these two socioeconomic variables and stroke risk. In the histograms above, the right-skewed distributions demonstrate that income and education level are associated with stroke risk respectively, where people ranked higher in these two levels are less likely to have stroke.
# distribution of stroke risk with respect to income and education levels
summary_tibble %>%
mutate(Income_Edu = paste(Income, Education, sep = ",")) %>%
select(Income_Edu,Stroke_1_pct,Stroke_0_pct) %>%
pivot_longer(-Income_Edu, names_to = "variable", values_to = "Percent") %>%
ggplot(aes(x = Income_Edu, y = Percent, fill = variable)) +
geom_bar(stat = "identity", position = "dodge", width = 1)+
scale_fill_manual(values = c("lightblue", "red"),labels=c('No Stroke', 'Has Stroke'))+
labs(fill = 'Stroke Status',x = "(Income level, Education level)",y="Percentage",title="Stroke ratio distribution with respect to combined income and educations levels")+
theme(plot.title = element_text(hjust = 0.5),axis.text.x = element_text(angle = 90, hjust = 1))
The same trend can be clearly presented in the histogram above where the independent variable is combined income and education level: within each income level, the ratio of experiencing stroke decreases with increased education level; and across all the income levels, the risk of stroke also decreases significantly with increased annual income.
Moving on from the socioeconomic view, we further investigate the association between health-related factors and stroke risk because many research results have indicated that daily life habits and other diseases can lead to higher probability of stroke.
strokeDt %>%
group_by(BMI) %>%
summarise(
numPpl = n(),
Stroke_1 = sum(Stroke == 1),
Stroke_0 = sum(Stroke == 0),
Stroke_1_pct = Stroke_1/numPpl*100,
Stroke_0_pct = 100 - Stroke_1_pct) %>%
select(BMI,Stroke_1_pct,Stroke_0_pct) %>%
ungroup() %>%
pivot_longer(-BMI, names_to = "variable", values_to = "Percent") %>%
ggplot(aes(x = BMI, y = Percent, color = variable)) +
geom_point()+
scale_color_manual(values = c("lightblue", "red"),labels=c('No Stroke', 'Has Stroke'))+
labs(color = 'Stroke Status', title = "BMI distribution with respect to stroke status",y="Percentage")+
theme(plot.title = element_text(hjust = 0.5))
BMI’s correlation with stroke risk is studied separately first since it is the only continuous variable among all the health-related factors. As indicated by the scatter plot above, proportion of people who experienced stroke distributes equally across group with different BMI. Thus, BIM is excluded from the following association analysis.
# refactor the Diabetes_012 values to represent both pre-diabetes and diabetes(2 originally) with 1
healthDt <- data.frame(strokeDt)
healthDt$Diabetes_012[healthDt$Diabetes_012==2] = 1
# drop responses where CholCheck == 0 (interviewee did not check cholesterol level within past five years)
healthDt <- healthDt[healthDt$CholCheck == 1, ]
# define adverse factors
adverse_disease <- c("Diabetes_012","HighBP","HighChol","HeartDiseaseorAttack")
adverse_lifestyle <- c("Smoker","HvyAlcoholConsump")
# visualize the results
pivoted_disease <-
sumTable_disease %>%
select(num_adverse_factors,Stroke_1_pct_disease) %>%
pivot_longer(-num_adverse_factors, names_to = "variable", values_to = "Percent") # reshape the table
pivoted_adv_lifestyle <-
sumTable_adv_lifestyle %>%
select(num_adverse_factors,Stroke_1_pct_lifestyle) %>%
pivot_longer(-num_adverse_factors, names_to = "variable", values_to = "Percent") # reshape the table
combinedDt_adver <- rbind(pivoted_disease,pivoted_adv_lifestyle) # combine two reshaped tables together
combinedDt_adver %>%
ggplot(aes(x=num_adverse_factors,y=Percent,color=variable))+
geom_line()+
geom_point(size=3)+
scale_color_manual(values = c("blue", "red"),labels=c('Disease', 'Adverse lifestyle'))+
labs(color = 'Type of adverse factors', title = "Association between number of adverse factors and stroke",x="Number of adverse factors",y="Percentage of stroke")+
theme(plot.title = element_text(hjust = 0.5))
Given the fact that most health related factors are “yes or no” questions with binary value, we choose total number of certain type of factors to which a respondent answers “yes” as the independent variable in our analysis. In this manner, the relationship between a series of factors and stroke can be examined in a single treatment.
The factors expected to have negative influence on one’s health, namely diseases and adverse life habits, are studied first. The result shows that chronic diseases - including high blood pressure, diabetes, and high cholesterol - and cardiovascular disease are significantly associated with high stroke risk. By contrast, adverse lifestyles such as smoking and heavy alcohol consumption do not exhibit clear correlation with stroke risk.
# define favorable factors
favorable_factors <- c("PhysActivity","Fruits","Veggies")
# Examine if there is association between number of favorable factors and probability of having stroke
healthDt %>%
select(Stroke,favorable_factors) %>%
mutate(num_favor_factors = PhysActivity + Fruits + Veggies) %>%
group_by(num_favor_factors) %>%
summarise(num_ppl = n(),
Stroke_1 = sum(Stroke==1),
Stroke_1_pct = Stroke_1/num_ppl*100) %>%
ungroup() %>%
ggplot(aes(x=num_favor_factors,y=Stroke_1_pct))+
geom_line(color="darkgreen")+
geom_point(size=3,color="darkgreen")+
labs(title = "Association between number of favorable factors and stroke",x="Number of favorable factors",y="Percentage of stroke")+
theme(plot.title = element_text(hjust = 0.5),legend.position="none")
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(favorable_factors)` instead of `favorable_factors` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
The relationship between factors favorable to one’s health and stroke risk are also studied for thorough understanding of stroke risk determinants. As the proportion of stroke decreases with total number of favorable factors, the result indicates that regular exercise and consumption of fruits and vegetables help to reduce the stroke risk.
Based on the associations discovered in the previous sections, a stroke prediction score is developed to assess a respondent’s risk of experiencing stroke. The score is calculated by subtracting total number of favorable factors from total number of adverse factors that a respondent have, and the distribution of the score is visualized in the box plot below.
healthDt %>%
mutate(score = Diabetes_012 + HighBP + HighChol + HeartDiseaseorAttack - (PhysActivity + Fruits + Veggies),
stroke_fct = as.factor(Stroke)) %>%
select(stroke_fct,score) %>%
ggplot(aes(x=stroke_fct,y=score,fill=stroke_fct))+
geom_boxplot(width=0.3)+
labs(title = "Distribution of stroke prediction score",x=" ",y="Score",fill="Stroke status")+
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(values = c("blue", "red"),labels=c('No stroke', 'Has stroke'))
Although the interquartile range are equal for both groups, the “has stroke” group score higher by 1 point overall than the “no stroke” group. This result is generally consistent with the associations between health related factors and stroke risk discovered previously.A comparison between the median score of both groups demonstrates that favorable factors may not neutralize the negative effect of adverse factors, but more favorable factors in general can help to decrease the stroke risk.
First and second part of this analysis can be potentially correlated because the socioeconomic status of a person can also influence their lifestyle and general health condition. Therefore, the underlying relationship behind these two aspects of the analysis should be further scrutinized, and this may become a topic for the research project.
Moving from the nationwide data, We choose to focus our attention on the state of Georgia for how socioeconomic factors can also be risk factors for stroke incidence. However, since stroke incidence data was lacking, we had to resort to using stroke mortality data text provided by the U.S. Department of Health and Human Services from 2000-2019. This data set provides county estimates of heart disease and stroke deaths by age group (ages 35-64 years, ages 65 year and older), race/ethnicity (American Indian/Alaskan Native, non-Hispanic Asian/Pacific Islander, non-Hispanic Black, Hispanic, non-Hispanic White), and sex (men, women). The estimates were provided using a Bayesion spatiotemporal model. Stroke mortality rates for this dataset were age-standardized in 10 year age groups using the 2010 U.S. population. We sampled data from the years 2013, 2014, and 2015 for our analysis.
Other datasets we considered were a FIPS code dataset text provided by GitHub to help with mapping the stroke mortality rates along with two other sets for per capita income and education attainment in Georgia. The income dataset was imported from the Federal Reserve Economic Data website text that contained per capita income estimates in Georgia per county. Education attainment data was difficult to find for all Georgian counties; the only data publicly available was for zip codes within the “Atlanta region”. This was joined with a dataset provided by Georgia Health Data text with a list of non-P.O. box zip codes and their respective counties. This was then added to the main data set used for this analysis - gadat_inc_edu. Because the education data set did not include every Georgian county, the final dataset contains “NA” values for those counties not included in the original data; it will be cleaned later on when education attainment and stroke mortality are inspected later in this analysis.
Below is the imported data with revisions to remove unnecessary data and clean up other variables.
#############################################
# IMPORTING AND REVISING STROKE MORTALITY DATA SET
# installing data set
getwd()
cardio_mort <- read_csv("Rates_and_Trends_in_Heart_Disease_and_Stroke_Mortality_Among_US_Adults__35___by_County__Age_Group__Race_Ethnicity__and_Sex___2000-2019.csv")
# filtering the data to select specific variables
cardio_mort1 <- cardio_mort %>%
filter(!is.na(Data_Value)) %>%
select(Year, LocationAbbr, LocationDesc, Class, Topic, Data_Value, Data_Value_Unit, Stratification1, Stratification2, Stratification3)
# renaming variables
new_names <- c(State = "LocationAbbr", County = "LocationDesc", Category = "Class", Age = "Stratification1", Race = "Stratification2", Gender = "Stratification3", Type = "Topic", Rate = "Data_Value", Unit = "Data_Value_Unit")
cardio_mort2 <- rename(cardio_mort1, all_of(new_names))
# Filtering data set for year, stroke, age, race, and gender (providing overall data)
years_mort <- c('2013','2014','2015')
stroke_mort_us <- cardio_mort2 %>%
filter(Year %in% years_mort,Type == "All stroke", Age == "Ages 35-64 years" || "Ages 65 years and older", Race == "Overall", Gender == "Overall")
names(stroke_mort_us) <- tolower(names(stroke_mort_us)) # set all column variable names to lower case
stroke_mort_us$county <- tolower(stroke_mort_us$county)
# Importing FIPS codes for US counties
fipsdat <- read_csv("fips-by-state.csv")
fipsdat$name <- word(fipsdat$name, 1)
fipsdat$name <- tolower(fipsdat$name)
fipsdat <- fipsdat %>%
rename(county = "name")
# left join to stroke_mort_us data set
stroke_mort_us <- left_join(stroke_mort_us, fipsdat, by = c("county", "state"), relationship = "many-to-many")
# Importing data for county income in GA. This will allow us to join to our main data set for GA and make comparisons between that and stroke mortality
inc <- read_csv("income.csv") %>%
select(county, income, year) # eliminates extra columns
inc$county <- tolower(inc$county) # county names to lowercase
inc$year <- as.character(inc$year)
inc$county <- replace(inc$county, inc$county == "de kalb", "dekalb")
stroke_mort_ga <- stroke_mort_us %>%
filter(state == "GA")
gadat_inc <- left_join(stroke_mort_ga, inc, by = c("county","year")) # join data sets
gadat_inc <- gadat_inc %>%
filter(state == "GA", year %in% c("2013","2014","2015"))
# Importing educational attainment dataset (5-year average from 2011-2015)
# This data set will be joined with the gadat_inc data set that was already created from the income and stroke_mort_us data sets
edu <- read_csv("edu_attainment.csv")
zip <- read_csv("zipcodes.csv") %>%
na.omit()
zip$zip <- as.double(zip$zip)
edu_county <- inner_join(edu,zip, by = c("ZIP" = "zip")) %>%
select(`County Name`, Less_than_HS_or_GED, BA_or_Higher, Percent_Less_than_HS_or_GED, Percent_BA_or_Higher)
edu_county$`County Name` <- tolower(edu_county$`County Name`)
gadat_inc_edu <- left_join(gadat_inc, edu_county, by = c("county" = "County Name"), relationship = "many-to-many")
Below are the first few observations of the finalized dataset.
knitr::kable(head(gadat_inc_edu), col.names = gsub("[.]", " ", names(gadat_inc_edu)))
| year | state | county | category | type | rate | unit | age | race | gender | fips | income | Less_than_HS_or_GED | BA_or_Higher | Percent_Less_than_HS_or_GED | Percent_BA_or_Higher |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | GA | appling | Cardiovascular Diseases | All stroke | 29.2 | per 100,000 | Ages 35-64 years | Overall | Overall | 13001 | 28353 | NA | NA | NA | NA |
| 2014 | GA | appling | Cardiovascular Diseases | All stroke | 29.7 | per 100,000 | Ages 35-64 years | Overall | Overall | 13001 | 29185 | NA | NA | NA | NA |
| 2015 | GA | appling | Cardiovascular Diseases | All stroke | 31.8 | per 100,000 | Ages 35-64 years | Overall | Overall | 13001 | 31414 | NA | NA | NA | NA |
| 2014 | GA | appling | Cardiovascular Diseases | All stroke | 368.5 | per 100,000 | Ages 65 years and older | Overall | Overall | 13001 | 29185 | NA | NA | NA | NA |
| 2015 | GA | appling | Cardiovascular Diseases | All stroke | 355.1 | per 100,000 | Ages 65 years and older | Overall | Overall | 13001 | 31414 | NA | NA | NA | NA |
| 2013 | GA | appling | Cardiovascular Diseases | All stroke | 337.3 | per 100,000 | Ages 65 years and older | Overall | Overall | 13001 | 28353 | NA | NA | NA | NA |
Our initial thoughts led us to attempt to provide a country-wide view of stroke mortality cases across the United States. Since our original dataset did not provide geographic information, other than the fact that it was collected across all 50 states (including U.S. territories), it was most reasonable to have the stroke mortality data provide a more spatial perspective of stroke mortalities across the U.S. However, as mentioned above, we decided to us the mortality cases to provide a more focused look at stroke within a particular region; in this case, a single state - Georgia. Below are three heat maps depicting stroke mortality cases across the country for 2013-15.
## MAPS FOR STROKE MORTALITY RATES
## FIRST MAP (2013 MORTALITY RATES): This map includes the mortality rates per county in the US from 2013. Note that all categories for age, race, and gender are included in the study
# Creating altered data set for 2013 year
stroke_mort_us_2013 <- stroke_mort_us %>%
filter(year == "2013") %>%
select(state, county, rate, fips) %>%
group_by(county, state, fips) %>%
summarize(avg_rate = mean(rate))
## `summarise()` has grouped output by 'county', 'state'. You can override using
## the `.groups` argument.
# MAP OF STROKE MORTALITIES PER COUNTY 2013: This is a non-interactive map of the stroke mortalities per county in the US in 2013
plot_usmap(data = stroke_mort_us_2013, values = "avg_rate", regions = "counties",color = "black") +
labs(title = "Stroke Mortality Rate (per 100k) within the United States, 2013") +
scale_fill_continuous(low = "white", high = "red", name = "Rate", labels = scales::comma) +
theme(plot.background = element_rect(color = "black",fill = "lightblue"), legend.position = "right")
## Warning: Ignoring unknown parameters: linewidth
# 2ND MAP (2014 MORTALITY RATES): This map includes the mortality rates per county in 2014
# Creating altered data set for 2014
stroke_mort_us_2014 <- stroke_mort_us %>%
filter(year == "2014") %>%
select(state, county, rate, fips) %>%
group_by(county, state, fips) %>%
summarize(avg_rate = mean(rate))
## `summarise()` has grouped output by 'county', 'state'. You can override using
## the `.groups` argument.
# MAP OF STROKE MORTALITIES PER COUNTY 2014: This is a non-interactive map of the stroke mortalities per county in the US in 2014
plot_usmap(data = stroke_mort_us_2014, values = "avg_rate", regions = "counties",color = "black") +
labs(title = "Stroke Mortality Rate (per 100k) within the United States, 2014") +
scale_fill_continuous(low = "white", high = "red", name = "Rate", labels = scales::comma) +
theme(plot.background = element_rect(color = "black", fill = "lightblue"), legend.position = "right")
## Warning: Ignoring unknown parameters: linewidth
# 3RD MAP (2015 MORTALITY RATES): This map include the mortality rates per county in 2015
# Creating altered data set for 2015
stroke_mort_us_2015 <- stroke_mort_us %>%
filter(year == "2015") %>%
select(state, county, rate, fips) %>%
group_by(county, state, fips) %>%
summarize(avg_rate = mean(rate))
## `summarise()` has grouped output by 'county', 'state'. You can override using
## the `.groups` argument.
# MAP OF STROKE MORTALITIES PER COUNTY 2015: This is a non-interactive map of the stroke mortalities per county in the US in 2015
plot_usmap(data = stroke_mort_us_2015, values = "avg_rate", regions = "counties",color = "black") +
labs(title = "Stroke Mortality Rate (per 100k) within the United States, 2015") +
scale_fill_continuous(low = "white", high = "red", name = "Rate", labels = scales::comma) +
theme(plot.background = element_rect(color = "black",fill = "lightblue"), legend.position = "right")
## Warning: Ignoring unknown parameters: linewidth
The grey counties represent those in which there is insufficient data. The lack of data in some counties further influenced our decision to focus our analysis on Georgian counties.
We choose to create heat maps divided by counties for Georgia displaying the rates of stroke mortality. We assumed this would help us observe any trends in stroke mortality rates across the state from 2013 to 2015.
# Importing county list (including longitude, latitude, group, and order numbers)
# alterations: only selected counties in Georgia, matched abbreviations to state names, and correcting a mistake in the county_df data
county_df <- map_data("county") %>%
filter(region == "georgia")
county_df$region <- state.abb[match(county_df$region, tolower(state.name))] # match abbreviations to state names
county_df$subregion <- replace(county_df$subregion, county_df$subregion == "de kalb", "dekalb")
# Creating gadat_inc_edu by joining stroke_mort_us and county_df an filtering by state and year
gadat_inc_edu <- left_join(gadat_inc_edu, county_df, by = c("county" = "subregion", "state" = "region"), relationship = "many-to-many") %>%
filter(state == "GA")
## [[1]]
##
## $title
## [1] "Stroke Mortality Rate (per 100k) in Georgia, 2013"
##
## attr(,"class")
## [1] "labels"
mapdat_ga_2014 <- gadat_inc_edu %>%
filter(year == "2014")
p2014 <- ggplot(mapdat_ga_2014, aes(x = long, y=lat, group=group, ids=county)) +
geom_polygon(aes(fill = rate),
color = alpha("black", 0.5)) +
scale_fill_gradient2(low = "white", high = "#FF2E00", name = "rate", limits = c(0,500)) +
theme_void() +
coord_quickmap()
ggplotly(p2014) %>%
labs(title = "Stroke Mortality Rate (per 100k) in Georgia, 2014")
## [[1]]
##
## $title
## [1] "Stroke Mortality Rate (per 100k) in Georgia, 2014"
##
## attr(,"class")
## [1] "labels"
mapdat_ga_2015 <- gadat_inc_edu %>%
filter(year == "2015")
p2015 <- ggplot(mapdat_ga_2015, aes(x = long, y=lat, group=group, ids=county)) +
geom_polygon(aes(fill = rate),
color = alpha("black", 0.5)) +
scale_fill_gradient2(low = "white", high = "#FF2E00", name = "rate", limits = c(0,500)) +
theme_void() +
coord_quickmap()
ggplotly(p2015) %>%
layout(title = "Stroke Mortality Rate (per 100k) in Georgia, 2015")
This map assisted with representing stroke mortality rates visually, but each map looked too similar to each. For this reason, we deemed it more useful to measure the percent change in mortality rate. The first time interval from 2013-14 and the second from 2014-15.
# Because the estimates for stroke mortality in each county appear very similar to each other, we move to observe trend in percentage change for stroke mortality
ga_perchan_2013_2014 <- gadat_inc_edu %>%
select(year, state, county, rate) %>%
filter(year %in% c("2013","2014")) %>%
group_by(state,county, year) %>%
summarize(avg_mort = mean(rate)) %>%
mutate(per_change = (avg_mort/lag(avg_mort) - 1) * 100) %>%
filter(year == 2014)
## `summarise()` has grouped output by 'state', 'county'. You can override using
## the `.groups` argument.
# joining ga_perchan_2013_2014 tibble with county_df data to map percent changes onto map of Georgia
names(ga_perchan_2013_2014) <- tolower(names(ga_perchan_2013_2014))
ga_perchan_2013_2014$county <- tolower(ga_perchan_2013_2014$county)
ga_perchan_2013_2014_2 <- left_join(ga_perchan_2013_2014, county_df, by = c("county" = "subregion", "state" = "region"), relationship = "many-to-many")
#Producing graph: this process is the same as the graphs for displaying the stroke mortality rates for Georgia above
pga_perchan_2013_2014 <- ggplot(ga_perchan_2013_2014_2, aes(x = long, y=lat, group=group, ids=county)) +
geom_polygon(aes(fill = per_change),
color = alpha("black", 0.5)) +
scale_fill_gradient2(low = "green", high = "red", name = "% change", limits = c(-7.0,17.0), breaks = c(-7.0,-3.5,0,3.5,7.0,10.0,13.5,17.0)) +
theme_void() +
coord_quickmap()
ggplotly(pga_perchan_2013_2014) %>%
layout(title = "Percent Change in Stroke Mortality Rates (per 100k) from 2013-14")
# The same process is repeated below, only for the 2014-2015 years
ga_perchan_2014_2015 <- gadat_inc_edu %>%
select(year, state, county, rate) %>%
filter(year %in% c("2014","2015"), state == "GA") %>%
group_by(state,county, year) %>%
summarize(avg_mort = mean(rate)) %>%
mutate(per_change = (avg_mort/lag(avg_mort) - 1) * 100) %>%
filter(year == 2015)
## `summarise()` has grouped output by 'state', 'county'. You can override using
## the `.groups` argument.
# joining
names(ga_perchan_2014_2015) <- tolower(names(ga_perchan_2014_2015))
ga_perchan_2014_2015$county <- tolower(ga_perchan_2014_2015$county)
ga_perchan_2014_2015_2 <- left_join(ga_perchan_2014_2015, county_df, by = c("county" = "subregion", "state" = "region"), relationship = "many-to-many")
# creating interactive map
pga_perchan_2014_2015 <- ggplot(ga_perchan_2014_2015_2, aes(x = long, y=lat, group=group, ids=county)) +
geom_polygon(aes(fill = per_change),
color = alpha("black", 0.5)) +
scale_fill_gradient2(low = "green", high = "red", name = "% change", limits = c(-7.0,17.0), breaks = c(-7.0,-3.5,0,3.5,7.0,10.0,13.5,17.0)) +
theme_void() +
coord_quickmap()
ggplotly(pga_perchan_2014_2015) %>%
layout(title = "Percent Change in Stroke Mortality Rates (per 100k) from 2014-15")
# Creating bar chart of percent changes over the three-year span: This visual will make the percent changes more easily interpreted so that we can focus on the
ga_perchan_2013_2014_bar <- ga_perchan_2013_2014 %>%
select(county, per_change) %>%
mutate(years = "2013-2014")
## Adding missing grouping variables: `state`
ga_perchan_2014_2015_bar <- ga_perchan_2014_2015 %>%
select(county, per_change) %>%
mutate(years = "2014-2015")
## Adding missing grouping variables: `state`
ga_perchan_all_years <- union(ga_perchan_2013_2014_bar, ga_perchan_2014_2015_bar) # combine data into single tibble
ggplot(ga_perchan_all_years, aes(x = per_change, fill = years)) +
geom_density(aes(color = years), alpha =0.6) +
labs(title = "Density Plots for Percent Changes in Stroke Mortality Rates from 2013-4 and 2014-15", x = "percent change (%)")
From the two maps, there was an increase in the percent change of stroke mortalities from the first to the 2nd interval in the northern regions of Georgia where the southern regions experience a decrease from the first to the second interval. Specifically, in the Atlanta metropolitan area, counties experienced a percent increase in stroke mortality rates. Among the counties with the greatest increase in percent change was Cobb county, with a 3.18% change from 2013-14 and a 16.93% from 2014-15. The rest of north Georgia experienced a similar trend. In southeast Georgia, Appling county was among the counties with the greatest decrease in stroke mortality, with an 8.65% change from 2013-14 to a -2.84% change from 2014-15. Other counties in southeast and east experienced a similar trend. To display these percent changes across the two intervals a density plots provided a clear understanding for how stroke mortality varied. Both density plots appear right-skewed, with most counties in the 2013-14 interval showing an overall negative percent change in stroke mortality while more counties in the 2015-16 interval show a positive percent change.
Based on the findings from the initial dataset, the per capita income for all Georgian counties were compared to stroke mortality rates. Age was also factored into the plot due to the disparity in stroke mortality rates between the younger and older populations. Mortality rates for all three years were taken.
# Creating scatter plot to display trend between income and absolute number of stroke mortality cases
ggplot(gadat_inc_edu, aes(x=income, y=rate, color = year, shape = age))+
geom_point() +
ggtitle("Rate of Stroke Mortality vs Per Capita Income for all counties in Georgia, 2013-15") +
xlab("average per capita income") +
ylab("stroke mortality rate (per 100,000)")
A majority of the data is concentrated near the lower end of the per capita income scale with a large disparity in stroke mortality rates between the two age groups (as expected). Observing closely, there is a slight increase in mortality rates with how the data is concentrated according to the year, from 2013 to 2015. For each cluster of data according to each age group. There appears to be a slight upward shift in the data, which may be indicative of a correlation between stroke mortality and per capita income as previously inspected. Data for 2015 appears to be concentrated around a higher stroke mortality rate overall compared to previous years across both age groups. A noticeable characteristic is that the older age group (top) shows significantly more variability in stroke mortality than the younger age group (bottom). As income level increases, this variability appears to decrease. This is not easily noticeable for the younger age group, but the 65+ years old age group shows a significant decrease in variability as per capita income level increases and approaches the 275 per 100,000 mortality rate, roughly. Further inquiry into these characteristics of the graph could be studied to either verify or deny if there is a correlation at all between income level and stroke mortality; however, an approach like this may lead us to focus on income levels for counties across the U.S. instead of those within Georgia as this trend (if there is any) may/may not be specific to its counties.
Educational Attainment was grouped according to people with less than a high school degree or GED and those with a bachelor’s of arts or higher. Because this data was limited, the only counties provided with this data were those centered around the Atlanta area: 1. Barrow 2. Bartow 3. Carroll 4. Cherokee 5. Clayton 6. Cobb 7. Coweta 8. Dekalb 9. Douglas 10. Fayette 11. Forsyth 12. Fulton 13. Gwinnett 14. Hall 15. Haralson 16. Henry 17. Newton 18. Paulding 19. Rockdale 20. Spalding 21. Walton
# Creating scatter plot to display trend between education level attainment based on few counties (based on zip codes) within the Atlanta region
# filtering data
gadat_inc_edu1 <- gadat_inc_edu %>% # exclude NAs and select year
na.omit() %>%
filter(year == 2015)
gadat_inc_edu1
gadat_inc_edu2 <- gadat_inc_edu1 %>% # restructure columns
pivot_longer(13:14, names_to = "degree", values_to = "num_pop") #rearrange columns for graph
gadat_inc_edu2
# plotting data
ggplot(gadat_inc_edu2, aes(x=num_pop, y = rate, color = degree, shape = age))+
geom_jitter() +
ggtitle("Rate of Stroke Mortality vs. Educational Attainment") +
xlab("number of people") +
ylab("stroke mortality rate (per 100,000)")
The two education levels do not differentiate from each other in the plot. Those with less than a high school degree or GED do not portray any trend different than from what the group with Bachelor’s degrees or greater did. Since this was also grouped by age group, both did not reveal any noticeable difference between education attainment. There is the same characteristic of high variance in the older population compared to the younger one as preciously discussed. Futher inquiry into education may not be the optimal variable to pursue in finding socioeconomic risk factors associated with stroke mortality.
Socioeconomic factors may play a larger role in stroke mortality rates across the country. We think that the income level will provide the most promising conclusions based on trends presented in this file. Practically, income would play a role in the health of an individual. Income affects whether or not the person can afford medical costs, treatments, or a healthy lifestyle.